Project Template¶

The following template repeats the essential steps for scraping, tokenizing, and mapping sentiment data. Each major step is further broken down into smaller steps. You can step through most of the code.

  • You will have to create your own dataframe names
  • Customize the filters
  • Customize the maps

Note¶

When your individual input is required the following symbol will appear:

You should save your results to a pickle file after each major step. These points are indicated by the save icon:

They are also titled **Export Pickle** You will need to import these results into the `project_presentation_template`

2 Data Input and Wrangling¶

Introduction¶

In this step, you will download a CSV file from Gutenberg, modify it, and save it locally for further analysis.

Process Steps:¶

  • Load a CSV file from a remote source
  • Save the file locally

Step-by-Step Guide¶

Step 1: Import the Required Libraries¶

The primary library for handling CSV data in Python is pandas. Make sure it’s imported at the beginning of your script.

import pandas as pd
In [9]:
import pandas as pd

Step 2: Import Pickle File¶

In lesson 2_1 we created a clean pickle file of the catalog, we can simply start there rather than running through all those steps.

In [11]:
pg_catalog_clean=pd.read_pickle('pg_catalog_clean.pickle')

Step 3: Create a custom dataframe¶

Overview¶

In lesson 2_2 we learned how to filter dataframes. For this part of the project, you are going to create your own pg_catalog dataframe that includes the corpus of works you want to analyze. Your corpus can be as large as you want, but keep in mind that having a lot of text can significantly increase your processing time.

Requirements¶

Please follow the following parameters when creating our custom dataframe:

  1. Unique descriptive dataframe name (i.e. not df_virginia_history)

  2. Corpus should have a logical coherence.

  3. You should aim for at least 25 texts

  4. Save the resulting dataframe as a pickle file

Note¶

Remember to make a deep copy of your dataframe by using the method .copy()

Example¶
df_virginia_history = df_pg_catalog[
    (df_pg_catalog.language == 'en') & 
    (df_pg_catalog.type == 'Text') 
].copy()

Save the dataframe to a pickle file:

df_virginia_history.to_pickle('custom_file.pickle')

3 Scraping Gutenberg¶

Overview¶

In the previous section you were meant to create a custom dataframe. Here you are going to scrape that data frame. The scraping function assumes that your first column is called text_id. If that is not the case then something went wrong with the filtering process.

Load gutenberg_scraper¶

Since you only need to perform one function fetch_text_data() all of the other logic has been tucked away in a .py file. You can load the python file and function in like any other import as long as the file is present in your root directory.

In [22]:
from gutenberg_scraper import fetch_text_data
In [23]:
#This line is here to let you walk through the code. It takes a random sample from the Project Gutenberg Catalog and runs it through all the steps.
YOUR_DATAFRAME = pg_catalog_clean.sample(10)
In [24]:
fetch_text_data(YOUR_DATAFRAME)
100%|██████████| 10/10 [00:08<00:00,  1.16it/s]
Out[24]:
text_id type issued title language subjects locc bookshelves second_author last_name first_name birth death text_data
42709 42783 Text 2013-05-24 Notes and Queries, Number 168, January 15, 185... en Questions and answers -- Periodicals AG Notes and Queries; Browsing: Culture/Civilizat... Bell, George, 1814-1890 [Editor] Various <NA> <NA> <NA> NOTES AND QUERIES, NUMBER 168, JANUARY 15, 185...
29725 29797 Text 2009-08-25 The Prison Chaplaincy, And Its Experiences en Prisons -- New Hampshire; Prisoners -- Religio... HV Browsing: History - American; Browsing: Religi... <NA> Quinby Hosea 1804 1878 THE PRISON CHAPLAINCY, AND ITS EXPERIENCES ***...
60350 60427 Text 2019-10-05 Introducing Irony: A Book of Poetic Short Stor... en American poetry -- 20th century; Short stories... PS Browsing: Literature; Browsing: Poetry; Browsi... <NA> Bodenheim Maxwell 1893 1954 INTRODUCING IRONY: A BOOK OF POETIC SHORT STOR...
68907 69000 Text 2022-09-16 The private life of Henry Maitland: A record d... en Gissing, George, 1857-1903 PR Browsing: Biographies; Browsing: Literature <NA> Roberts Morley 1857 1942 THE PRIVATE LIFE OF HENRY MAITLAND: A RECORD D...
43403 43477 Text 2013-08-15 Bell's Cathedrals: The Cathedral Church of Lin... en Lincoln Cathedral <NA> Browsing: Architecture; Browsing: Art & Photog... <NA> Kendrick A. F. (Albert Frank) 1872 1954 BELL'S CATHEDRALS: THE CATHEDRAL CHURCH OF LIN...
40084 40158 Text 2012-07-07 Manners: A Novel, Vol 1 en England -- Social life and customs -- Fiction PR Browsing: Culture/Civilization/Society; Browsi... <NA> Panache Madame <NA> <NA> MANNERS: A NOVEL, VOL 1 ***\r\n\r\n\r\n\r\n\r\...
33998 34071 Text 2010-10-15 The Evolution of an Empire: A Brief Historical... en France -- History DC Browsing: History - European; Browsing: Histor... <NA> Parmele Mary Platt 1843 1911 ***
4898 4939 Text 2004-01-01 Chopin : the Man and His Music en Chopin, Frédéric, 1810-1849; Composers -- Biog... ML Browsing: Biographies; Browsing: Music <NA> Huneker James 1857 1921 ***
477 496 Text 2006-01-16 The Little Lame Prince en Children's literature; Fairy tales; Princes --... PZ Children's Literature; Browsing: Children & Yo... <NA> Craik Dinah Maria Mulock 1826 1887 THE LITTLE LAME PRINCE ***\r\n\r\n\r\n\r\n\r\n...
37937 38011 Text 2011-11-14 The Epic of Hades, in Three Books en Gods, Greek -- Poetry; Mythology, Greek -- Poetry PR Browsing: Culture/Civilization/Society; Browsi... <NA> Morris Lewis 1833 1907 THE EPIC OF HADES, IN THREE BOOKS ***\r\n\r\n\...

Export Pickle 1¶

Export the file as a pickle file for presentation.

YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_TEXTS.pickle')
In [26]:
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_TEXTS.pickle')

4 Clean DataFrame for Analysis¶

To prepare our data for analysis we will:

  • Split it into sentences
  • Clean the individual sentences
  • Drop unnecessary data

For the code below you will have to replace YOUR_DATAFRAME with the name of your dataframe.

Import NLTK¶

We can use NLTK for some basic preprocessing.

import nltk
import re
In [33]:
import nltk
import re

Step 1: Tokenize Text into Sentences¶

In [35]:
# Explodes the DataFrame so that each row corresponds to a single sentence
YOUR_DATAFRAME = YOUR_DATAFRAME.assign(
    sentences=YOUR_DATAFRAME['text_data'].apply(nltk.sent_tokenize)
).explode('sentences')

Step 2: Remove the 'text_data' column¶

In [37]:
YOUR_DATAFRAME = YOUR_DATAFRAME.drop(columns='text_data')

Step 3: Define a Cleaning Function for Sentences¶

In [39]:
def clean_sentence(sentence):
    # 1. Remove text inside square brackets
    sentence = re.sub(r'\[.*?\]', '', sentence)
    # 2. Remove unwanted punctuation but retain sentence-ending punctuation
    sentence = re.sub(r'[^\w\s,.!?\'"‘’“”`]', '', sentence)
    # 3. Remove newline and carriage return characters, and underscores
    sentence = sentence.replace('\n', ' ').replace('\r', ' ').replace('_', '')
    # 4. Return an empty string for all-uppercase sentences (likely headers or TOC entries)
    return '' if sentence.isupper() else sentence

Step 4: Apply Cleaning and Remove Empty Sentences¶

In [41]:
# Apply the cleaning function, then filter out any sentences that are empty strings
YOUR_DATAFRAME['cleaned_sentences'] = YOUR_DATAFRAME['sentences'].apply(clean_sentence)
YOUR_DATAFRAME = YOUR_DATAFRAME[YOUR_DATAFRAME['sentences'] != '']

Step 5: Reset Index for the Cleaned DataFrame¶

In [43]:
YOUR_DATAFRAME = YOUR_DATAFRAME.reset_index(drop=True)

Step 6: (OPTIONAL) Save deep copy of dataframe and pickle¶

In [45]:
#YOUR_DATAFRAME_DEEP = YOUR_DATAFRAME.copy()
#YOUR_DATAFRAME_DEEP.to_pickle('YOUR_DATAFRAME_DEEP.pickle')

5 Perform Initial Tokenization¶

Overview¶

Since the geoparsing process is quite intense, we can actually reduce our processing overhead a bit by eliminating those sentences that likely don't have toponyms. We can do so by first running a pass with the lightweight en_core_web_sm spacy library

Load Spacy¶

We are going to load spacy and the small library at the same time.

In [50]:
import spacy
from tqdm import tqdm
tqdm.pandas()
# Load spaCy's English model
nlp = spacy.load('en_core_web_sm')
/home/joostburgers/anaconda3/lib/python3.12/site-packages/torch/cuda/__init__.py:654: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

Load Functions into memory¶

In [52]:
# Function to extract GPE (Geopolitical Entities) from a batch of docs
def extract_gpe_from_docs(docs):
    return [[ent.text for ent in doc.ents if ent.label_ == 'GPE'] or None for doc in docs]

# Use nlp.pipe() for faster batch processing with multiple cores
def process_sentences_in_batches(sentences, batch_size=50, n_process=-1):
    # Process sentences using nlp.pipe with batch processing and multi-processing
    gpe_results = []
    for doc in tqdm(nlp.pipe(sentences, batch_size=batch_size, n_process=n_process), total=len(sentences)):
        gpes = [ent.text for ent in doc.ents if ent.label_ == 'GPE']
        gpe_results.append(gpes if gpes else None)
    return gpe_results

Process your DataFrame¶

In [54]:
YOUR_DATAFRAME['toponyms'] = process_sentences_in_batches(YOUR_DATAFRAME['cleaned_sentences'])
(YOUR_DATAFRAME['cleaned_sentences'])
100%|██████████| 14564/14564 [01:40<00:00, 145.05it/s]
Out[54]:
0        NOTES AND QUERIES, NUMBER 168, JANUARY 15, 185...
1                            "When found, make a note of."
2                                                         
3                                                      No.
4                                                     168.
                               ...                        
14559                                See the Orphic Hymns.
14560                                                     
14561    Transcriber's Notes    This text is hemistichi...
14562          Inconsistent Hyphenation and text retained.
14563                                                     
Name: cleaned_sentences, Length: 14564, dtype: object

Clean up the result¶

As we saw in the most intense part of the extraction process in lesson_5 we want to reduce the number of sentences being processed to lower the computation time. We can do two things at this stage.

  1. Eliminate unnecessary columns
  2. Eliminate all sentences for which there is no result
  3. Eliminate all sentences with very few results. Your group can decide on the threshold, but suffice to say that all toponyms with a count of 1 won't be relevant. You can adjust this number as you fine-tune your model.

Eliminate Unncessary Columns¶

In [58]:
YOUR_DATAFRAME = YOUR_DATAFRAME.drop(columns=['language', 'issued', 'type', 'locc', 'bookshelves', 'second_author']).copy()

Eliminate None¶

In [60]:
YOUR_DATAFRAME = YOUR_DATAFRAME[YOUR_DATAFRAME.toponyms.notna()]

To eliminate the some of the complicated processing the function below adds a count column to the dataframe.

In [62]:
def add_toponym_count(df, toponym_col='toponyms', sentence_col='cleaned_sentences'):
    """
    Processes the DataFrame to count toponyms and aggregate back to sentences, keeping all original columns.
    
    Args:
        df (pd.DataFrame): The DataFrame containing toponyms and sentences.
        toponym_col (str): Column containing the toponyms as lists.
        sentence_col (str): Column containing the cleaned sentences.
    
    Returns:
        pd.DataFrame: A DataFrame grouped by sentences with a list of toponyms, their counts, and all original columns.
    """
    
    # Step 1: Explode the 'toponyms' column
    exploded_df = df.explode(toponym_col)
    
    # Step 2: Group by 'toponyms' to count occurrences and add 'nltk_toponym_count' column
    toponym_counts = exploded_df.groupby(toponym_col).size().reset_index(name='nltk_toponym_count')
    
    # Step 3: Merge the counts back to the exploded DataFrame
    exploded_df = exploded_df.merge(toponym_counts, on=toponym_col, how='left')
    
    # Step 4: Group by 'cleaned_sentences' and aggregate all columns
    # Use 'first' to retain the first non-null value for each original column, and 'list' for the toponym_col
    aggregation_dict = {col: 'first' for col in df.columns if col not in [sentence_col, toponym_col]}
    aggregation_dict[toponym_col] = lambda x: list(x)  # Aggregate toponyms into lists
    aggregation_dict['nltk_toponym_count'] = 'first'   # Take the first count (all counts are the same within groups)
    
    result_df = exploded_df.groupby(sentence_col).agg(aggregation_dict).reset_index()
    
    return result_df
In [63]:
YOUR_DATAFRAME = add_toponym_count(YOUR_DATAFRAME)

Filter out low toponym counts¶

Your dataframe now has the new variable nltk_toponym_count. You can filter out low count results to get fewer sentences. You can get a data frame for all cleaned sentences where the nltk_toponym_count is greater than 1.

In [67]:
#Remove toponyms with a low count

(Optional) Save pickle of tokenization¶

In [70]:
#YOUR_DATAFRAME_TOPONYM.to_pickle('YOUR_DATAFRAME_TOPONYM.pickle')

6 Geoparsing (Deep Scan)¶

Overview¶

Since the deep scan for toponyms will likely reduce the size of the dataframe again, we can backload the sentiment analysis as the last step to ensure we don't process data unnecessarily.

In [74]:
from geoparser import Geoparser
from tqdm.notebook import tqdm

Because there are some compatibility issues with the geoparser package, there are pesky warnings that pop-up. These do not affect the output, but they are annoying. The line below filters these out of the console.

In [76]:
import warnings

# Suppress all FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Load Geoparser¶

To use Geoparser, instantiate an object of the Geoparser class with optional specifications for the spaCy model, transformer model, and gazetteer. By default, the library uses an accuracy-optimised configuration:

In [79]:
geo = Geoparser(spacy_model='en_core_web_trf', transformer_model='dguzh/geo-all-distilroberta-v1', gazetteer='geonames')

Load in the geoparse_column function to simplify the toponym recognition process.

In [81]:
def geoparse_column(df):
    sentences = df['cleaned_sentences'].tolist()  # Convert column to list
    docs = geo.parse(sentences, feature_filter=['A', 'P'])  # Run geo.parse on the entire list

    # Initialize lists to store the extracted fields
    places, latitudes, longitudes, feature_names = [], [], [], []

    # Iterate through the results and extract toponyms and their locations
    for doc in docs:
        doc_places = []
        doc_latitudes = []
        doc_longitudes = []
        doc_feature_names = []

        for toponym in doc.toponyms:
            if toponym.location:
                doc_places.append(toponym.location.get('name'))
                doc_latitudes.append(toponym.location.get('latitude'))
                doc_longitudes.append(toponym.location.get('longitude'))
                doc_feature_names.append(toponym.location.get('feature_name'))
            else:
                doc_places.append(None)
                doc_latitudes.append(None)
                doc_longitudes.append(None)
                doc_feature_names.append(None)

        # Append the extracted data for the document
        places.append(doc_places)
        latitudes.append(doc_latitudes)
        longitudes.append(doc_longitudes)
        feature_names.append(doc_feature_names)

    # Assign the extracted data to the DataFrame as new columns
    df['place'] = places
    df['latitude'] = latitudes
    df['longitude'] = longitudes
    df['feature_name'] = feature_names

    return df
In [82]:
geoparse_column(YOUR_DATAFRAME)
Toponym Recognition...
Batches:   0%|          | 0/1388 [00:00<?, ?it/s]
Toponym Resolution...
Batches:   0%|          | 0/873 [00:00<?, ?it/s]
Batches:   0%|          | 0/158 [00:00<?, ?it/s]
Out[82]:
cleaned_sentences text_id title subjects last_name first_name birth death sentences toponyms nltk_toponym_count place latitude longitude feature_name
0 76 Now re... 42783 Notes and Queries, Number 168, January 15, 185... Questions and answers -- Periodicals Various <NA> <NA> <NA> * * * * *\r\n\r\n\r\n{... [pp] 4 [] [] [] []
1 ARCHER'S PHO... 42783 Notes and Queries, Number 168, January 15, 185... Questions and answers -- Periodicals Various <NA> <NA> <NA> * * * * *\r\n\r\n\r\nA... [MR] 3 [] [] [] []
2 TO PHOTOGRAP... 42783 Notes and Queries, Number 168, January 15, 185... Questions and answers -- Periodicals Various <NA> <NA> <NA> * * * * *\r\n\r\n\r\nT... [Brébisson] 1 [] [] [] []
3 If you have received the various lett... 40158 Manners: A Novel, Vol 1 England -- Social life and customs -- Fiction Panache Madame <NA> <NA> ----\r\n\r\n If you have received the vari... [England] 109 [England, Rhodes, Algiers] [52.16045, 53.54368, 36.73225] [-0.70312, -2.22551, 3.08746] [first-order administrative division, populate...
4 George Drew, Esq. 42783 Notes and Queries, Number 168, January 15, 185... Questions and answers -- Periodicals Various <NA> <NA> <NA> ;\r\n George Drew, Esq. [Esq] 22 [] [] [] []
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1383 the old world Is full of Hate. 38011 The Epic of Hades, in Three Books Gods, Greek -- Poetry; Mythology, Greek -- Poetry Morris Lewis 1833 1907 the old world\r\nIs full of Hate. [Hate] 1 [] [] [] []
1384 vi., pp. 42783 Notes and Queries, Number 168, January 15, 185... Questions and answers -- Periodicals Various <NA> <NA> <NA> vi., pp. [pp] 4 [] [] [] []
1385 was written, one of these fabrics of the olden... 42783 Notes and Queries, Number 168, January 15, 185... Questions and answers -- Periodicals Various <NA> <NA> <NA> was written, one of these fabrics of the olden... [Kington] 1 [] [] [] []
1386 what fine pearls you have got onyour mother's... 40158 Manners: A Novel, Vol 1 England -- Social life and customs -- Fiction Panache Madame <NA> <NA> what\r\nfine pearls you have got on--your _mot... [suppose, Miss.] 1 [] [] [] []
1387 xi., ver. 43477 Bell's Cathedrals: The Cathedral Church of Lin... Lincoln Cathedral Kendrick A. F. (Albert Frank) 1872 1954 xi., ver. [xi.] 1 [Budapest XI.] [47.475] [19.03605] [second-order administrative division]

1388 rows × 15 columns

Export Pickle 2¶

As the geoparsing process takes a long time, you should store it right after the result. You will also import these results for your project_presentation_template

YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_PLACES.pickle')
In [86]:
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_PLACES.pickle')

Clean up the resulting dataframe¶

As with the previous instance of toponym resolution, there will be some rows that do not contain relevant information. This will slow down the sentiment analysis.

  1. Eliminate empty results
In [89]:
YOUR_DATAFRAME = YOUR_DATAFRAME[YOUR_DATAFRAME['place'].str.len() != 0].copy()

7 Run Sentiment Analysis¶

Overview¶

We will now implement the sentiment analysis on the remaining sentences.

Read step through and read all the prerequisites into memory.

In [94]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
from scipy.special import softmax
from typing import Dict, Any
In [95]:
# Initialize RoBERTa. There will probably be a warning. You can ignore this.
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]
vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]
merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]
pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]
In [96]:
# Function to calculate RoBERTa sentiment scores
def polarity_scores_roberta(text: str) -> Dict[str, float]:
    """
    Calculate RoBERTa sentiment scores for a given text.
    
    Args:
    - text: The text to analyze
    
    Returns:
    - A dictionary with sentiment scores for negative, neutral, and positive sentiment
    """
    # Tokenize and truncate to max length (512 tokens)
    encoded_text = tokenizer.encode_plus(
        text, 
        max_length=512, 
        truncation=True, 
        return_tensors='pt'
    )
    
    # Get model output and convert to probabilities
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    
    return {
        'roberta_neg': scores[0],
        'roberta_neu': scores[1],
        'roberta_pos': scores[2]
    }
In [97]:
# Function to attach sentiment analysis to a specific column in the dataframe
def add_sentiment_to_column(
    df: pd.DataFrame, column_name: str, num_rows: int = None
) -> pd.DataFrame:
    """
    Adds RoBERTa sentiment analysis to a specified column in a dataframe.
    
    Args:
    - df: The dataframe to process
    - column_name: The name of the column containing the text to analyze
    - num_rows: The number of rows to process (default: 500)
    
    Returns:
    - df: A dataframe with added sentiment analysis columns
    """
        # If num_rows is specified, limit the dataframe, otherwise process all rows
    if num_rows:
        df_subset = df.head(num_rows).reset_index(drop=True)
    else:
        df_subset = df.reset_index(drop=True)  # Process all rows and reset the index
    
    # Function to process each row and add sentiment analysis
    def process_row(text: str) -> Dict[str, Any]:
        try:
            return polarity_scores_roberta(text)
        except Exception as e:
            print(f"Error processing text: {text}. Error: {e}")
            return {'roberta_neg': None, 'roberta_neu': None, 'roberta_pos': None}
    
    # Apply the RoBERTa sentiment analysis to each row
    tqdm.pandas(desc="Processing Sentiment Analysis")
    sentiment_scores = df_subset[column_name].progress_apply(process_row)
    
    # Convert the resulting list of dictionaries into a DataFrame and concatenate it with the original subset
    sentiment_df = pd.DataFrame(sentiment_scores.tolist())
    df_subset = pd.concat([df_subset, sentiment_df], axis=1)
    
    return df_subset
In [98]:
YOUR_DATAFRAME = add_sentiment_to_column(YOUR_DATAFRAME, 'cleaned_sentences')
Processing Sentiment Analysis: 100%|██████████| 788/788 [01:04<00:00, 12.27it/s]

Create an aggregate score¶

Since the roberta score is positive, negative, and neutral, we will have to consolidate it into one easier to understand score. We will take the difference between positive and negative, and multiply it by the percentage of neutral. This way if a score is very neutral it will even out the difference between positive and negative.

In [101]:
# Calculate the compound score and add it as a new column 'roberta_compound'
YOUR_DATAFRAME['roberta_compound'] = (
    YOUR_DATAFRAME['roberta_pos'] - YOUR_DATAFRAME['roberta_neg']
) * (1 - YOUR_DATAFRAME['roberta_neu'])

Explode, filter, and aggregate¶

At the moment, there are places and sentiments, but since some of the sentences contain multiple places these need to be unnested.

In [104]:
YOUR_DATAFRAME = YOUR_DATAFRAME.explode(['place', 'latitude', 'longitude', 'feature_name'])

Remove empty values

In [106]:
YOUR_DATAFRAME = YOUR_DATAFRAME[YOUR_DATAFRAME.place.notna()]

Aggregate the data

In [108]:
YOUR_DATAFRAME = YOUR_DATAFRAME.groupby('place').agg(
    location_count=('place', 'size'),  # Count occurrences of each location
    latitude=('latitude', 'first'),    # Take the first latitude (you can also use 'mean')
    longitude=('longitude', 'first'),  # Take the first longitude (or 'mean')
    location=('feature_name','first'),
    avg_roberta_pos=('roberta_pos', 'mean'),  # Average of roberta_pos
    avg_roberta_neu=('roberta_neu', 'mean'),  # Average of roberta_neu
    avg_roberta_neg=('roberta_neg', 'mean'), # Average of roberta_neg
    avg_roberta_compound =('roberta_compound','mean')
).reset_index()

Create Histogram of Count Values (Optional)¶

To get a sense of how the data is distributed and to decide which data to include, you can create a histogram of the location_count column fairly easily.

In [111]:
import matplotlib.pyplot as plt
# You might need to install matplotlib with 
# pip install matplotlib

YOUR_DATAFRAME.location_count.plot.hist(bins=10, alpha=0.7)
Out[111]:
<Axes: ylabel='Frequency'>
No description has been provided for this image

Generally, the data will be very left skewed. You might want to filter out some of the lower values.

Filter out low counts¶

As very low counts will not show up on the map anyway, filter them out here. No code has been provided, but the procedure is essentially the same as before.

In [114]:
#code here

Bucket Data¶

As we saw in lesson_5, the distribution of the data is tricky. We can solve this by bucketing it along the lines of Jenks Natural Breaks.

In [117]:
import mapclassify as mc #you may get an error. If so install mapclassify with pip install mapclassify

jenks_breaks = mc.NaturalBreaks(y=YOUR_DATAFRAME['location_count'], k=5)
YOUR_DATAFRAME.loc[:,'location_count_bucket'] = jenks_breaks.find_bin(YOUR_DATAFRAME['location_count'])+1

Export Pickle 3¶

This is the final export of the file for the project_presentation_template

YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_SENTIMENTS.pickle')
In [119]:
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_SENTIMENTS.pickle')

Map your Data¶

Overview¶

This is the core of the project. Use the stub below to map your data and then customize the map. I have deliberately set some of the values very poorly to encourage you to work on your own map!

In [123]:
import plotly.express as px

fig = px.scatter_mapbox(
    YOUR_DATAFRAME,  #put your dataframe here
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="location_count_bucket",        # Bubble size based on location count
    color="avg_roberta_compound",      # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.Twilight[::-1],  # Use Twilight scale (blue to red)
    size_max=30,                  # Maximum size of the bubbles
    center={"lat": 48, "lon": 2},
    zoom=6                       # Adjust zoom level for better visibility
)

# Update the layout to use the default map style (which doesn't need a token)
fig.update_layout(
    mapbox_style="open-street-map",  # No token needed for this style
    margin={"r":0,"t":0,"l":0,"b":0}  # Remove margins for a cleaner view
)



fig.show()

Happy mapping!

In [ ]: